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Despite significant progress in the theory and practice of program analysis, analyzing properties 
of heap data has not reached the same level of maturity as the analysis of static and stack data. 
The spatial and temporal structure of stack and static data is well understood while that of heap 
data seems arbitrary and is unbounded. We devise bounded representations which summarize 
properties of the heap data. This summarization is based on the structure of the program which 
manipulates the heap. The resulting summary representations are certain kinds of graphs called 
access graphs. The boundedness of these representations and the monotonicity of the operations 
to manipulate them make it possible to compute them through data flow analysis. 

An important application which benefits from heap reference analysis is garbage collection, 
where currently liveness is conservatively approximated by reachability from program variables. 
As a consequence, current garbage collectors leave a lot of garbage uncollected, a fact which 
has been confirmed by several empirical studies. We propose the first ever end-to-end static 
analysis to distinguish live objects from reachable objects. We use this information to make dead 
objects unreachable by modifying the program. This application is interesting because it requires 
discovering data flow information representing complex semantics. In particular, we formulate 
the following new analyses for heap data: liveness, availability, and anticipability and propose 
solution methods for them. Together, they cover various combinations of directions of analysis 
(i.e. forward and backward) and confluence of information (i.e. union and intersection). Our 
analysis can also be used for plugging memory leaks in C/CH — h languages. 

Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors — Memory management 
(garbage collection); Optimization; F.3.2 [Logics and Meanings Of Programs]: Semantics of Programming 
Languages — Program analysis 

General Terms: Algorithms, Languages, Theory 

Additional Key Words and Phrases: Aliasing, Data Flow Analysis, Heap References, Liveness 



1. INTRODUCTION 

Conceptually, data in a program is allocated in either the static data area, stack, or heap. 
Despite significant progress in the theory and practice of program analysis, analyzing the 
properties of heap data has not reached the same level of maturity as the analysis of static 
and stack data. Section 1.2 investigates possible reasons. 

In order to facilitate a systematic analysis, we devise bounded representations which 
summarize properties of the heap data. This summarization is based on the structure of the 
program which manipulates the heap. The resulting summary representations are certain 
kinds of graphs, called access graphs which are obtained through data flow analysis. We 
believe that our technique of summarization is general enough to be also used in contexts 
other than heap reference analysis. 

1.1 Improving Garbage Collection through Heap Reference Analysis 

An important application which benefits from heap reference analysis is garbage collection, 
where liveness of heap data is conservatively approximated by reachability. This amounts 
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1 . w = x II x points to m a 

2. while (x.getdata() < max) 

{ 

3. x = x.rptr 

} 

4. y = x.lptr 

5. z = New class. of j. // Possible GC Point 

6. y = y.lptr 

7. z.sum = x.lptr. getdata() +y.getdata() 

(a) A Program Fragment 



i ®c 

Stack ] Heap *■ 

(b) Superimposition of memory graphs before line 5. 
Dashed arrows capture the effect of different iter- 
ations of the while loop. All thick arrows (both 
dashed and solid) are live links. 



; = null 
w = x 
w = null 

while (x.getdataQ < max 

x = x.rptr 

^^^^ 
x.rptr = x.lptr. rptr = nui 
x.lptr. Iptr.lptr = null 
x.lptr. Iptr. 





y = x.lptr 

y.rptr = y.lptr. Iptr = y.lptr. rptr = null 
z = New class jafjz 
z.lptr = z.rptr = null 
y = y.lptr 

x.lptr. Iptr = y.lptr = y.rptr = null 
z.sum = x.lptr. getdata() +y.getdata() 




(c) The modified program. Highlighted statements in- 
dicate the null assignments inserted in the program 
using our method. (More details in Section 4) 



Fig. 1 . A motivating example. 



to approximating the future of an execution with its past. Since current garbage collectors 
cannot distinguish live data from data that is reachable but not live, they leave a lot of 
garbage uncollected. This has been confirmed by empirical studies [Hirzel et al. 2002; 
Hirzel et al. 2002; Shaham et al. 2000; 2001; 2002] which show that a large number (24% 
to 76%) of heap objects which are reachable at a program point are actually not accessed 
beyond that point. In order to collect such objects, we perform static analyses to make dead 
objects unreachable by setting appropriate references to null. The idea that doing so would 
facilitate better garbage collection is well known as "Cedar Mesa Folk Wisdom" [Gadbois 
et al. ]. The empirical attempts at achieving this have been [Shaham et al. 2001; 2002]. 

Garbage collection is an interesting application for us because it requires discovering 
data flow information representing complex semantics. In particular, we need to discover 
four properties of heap references: liveness, aliasing, availability, and anticipability. Live- 
ness captures references that may be used beyond the program point under consideration. 
Only the references that are not live can be considered for null assignments. Safety of null 
assignments further requires (a) discovering all possible ways of accessing a given heap 
memory cell (aliasing), and (b) ensuring that the reference being nullified is accessible 
(availability and anticipability). 

For simplicity of exposition, we present our method using a memory model similar to 
that of Java. Extensions required for handling C/C++ model of heap usage are easy and are 
explained in Section 8. We assume that root variable references are on the stack and the 
actual objects corresponding to the root variables are in the heap. In the rest of the paper we 
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ignore non-reference variables. We view the heap at a program point as a directed graph 
called memory graph. Root variables form the entry nodes of a memory graph. Other 
nodes in the graph correspond to objects on the heap and edges correspond to references. 
The out-edges of entry nodes are labeled by root variable names while out-edges of other 
nodes are labeled by field names. The edges in the memory graph are called links. 

EXAMPLE 1.1. Figure 1 shows a program fragment and its memory graphs before line 
5. Depending upon the number of times the while loop is executed x points to m a , nib, m c 
etc. Correspondingly, y points to m,, mf, m g etc. The call to New on line 5 may require 
garbage collection. A conventional copying collector will preserve all nodes except m\,. 
However, only a few of them are used beyond line 5. 

The modified program is an evidence of the strength of our approach. It makes the 
unused nodes unreachable by nullifying relevant links. The modifications in the program 
are general enough to nullify appropriate links for any number of iterations of the loop. 
Observe that a null assignment has also been inserted within the loop body thereby making 
some memory unreachable in each iteration of the loop. □ 

After such modifications, a garbage collector will collect a lot more garbage. Further, 
since copying collectors process only live data, garbage collection by such collectors will 
be faster. Both these facts are corroborated by our empirical measurements (Section 7). 

In the context of C/C++, instead of setting the references to null, allocated memory will 
have to be explicitly deallocated after checking that no alias is live. 

1.2 Difficulties in Analyzing Heap Data 

A program accesses data through expressions which have 1-values and hence are called 
access expressions. They can be scalar variables such as x, or may involve an array access 
such as a[2 * i], or can be a reference expression such as x.l.r. 

An important question that any program analysis has to answer is: Can an access expres- 
sion OCi at program point p\ have the same l-value as 0C2 at program point pi? Note that 
the access expressions or program points could be identical. The precision of the analysis 
depends on the precision of the answer to the above question. 

When the access expressions are simple and correspond to scalar data, answering the 
above question is often easy because, the mapping of access expressions to 1-values re- 
mains fixed in a given scope throughout the execution of a program. However in the case 
of array or reference expressions, the mapping between an access expression and its l-value 
is likely to change during execution. From now on, we shall limit our attention to reference 
expressions, since these are the expressions that are primarily used to access the heap. Ob- 
serve that manipulation of the heap is nothing but changing the mapping between reference 
expressions and their 1-values. For example, in Figure 1, access expression x.lptr refers to 
m, when the execution reaches line number 2 and may refer to m,-, mf, m g , or m e at line 4. 

This implies that, subject to type compatibility, any access expression can correspond to 
any heap data, making it difficult to answer the question mentioned above. The problem 
is compounded because the program may contain loops implying that the same access 
expression appearing at the same program point may refer to different 1-values at different 
points of time. Besides, the heap data may contain cycles, causing an infinite number 
of access expressions to refer to the same l-value. All these make analysis of programs 
involving heaps difficult. 
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1.3 Contributions of This Paper 

The contributions of this paper fall in the following two categories 

— Contributions in Data Flow Analysis. We present a data flow framework in which the 
data flow values represent abstractions of heap. An interesting aspect of our method is 
the way we obtain bounded representations of the properties by using the structure of 
the program which manipulates the heap. As a consequence of this summarization, the 
values of data flow information constitute a complete lattice with finite height. Further, 
we have carefully identified a set of monotonic operations to manipulate this data flow 
information. Hence, the standard results of data flow analysis can be extended to heap 
reference analysis. Due to the generality of this approach, it can be applied to other 
analyses as well. 

— Contributions in Heap Data Analysis. We propose the first ever end-to-end solution 
(in the intraprocedural context) for statically discovering heap references which can be 
made null to improve garbage collection. The only approach which comes close to our 
approach is the heap safety automaton based approach [Shaham et al. 2003]. However, 
our approach is superior to their approach in terms of completeness, effectiveness, and 
efficiency (details in Section 9.2). 

The concept which unifies the contributions is the summarization of heap properties 
which uses the fact that the heap manipulations consist of repeating patterns which bear 
a close resemblance to the program structure. Our approach to summarization is more 
natural and more precise than other approaches because it does not depend on an a-priori 
bound [Jones and Muchnick 1979; 1982; Larus and Hilfinger 1988; Chase et al. 1990]. 

1.4 Organization of the paper 

The rest of the paper is organized as follows. Section 2 defines the concept of explicit 
liveness of heap objects and formulates a data flow analysis by using access graphs as data 
flow values. Section 3 defines other properties required for ensuring safety of null assign- 
ment insertion. Section 4 explains how null assignments are inserted. Section 5 discusses 
convergence and complexity issues. Section 6 shows the soundness of our approach. Sec- 
tion 7 presents empirical results. Section 8 extends the approach to C++. Section 9 reviews 
related work while Section 10 concludes the paper. 

2. EXPLICIT LIVENESS ANALYSIS OF HEAP REFERENCES 

Our method discovers live links at each program point, i.e., links which may be used in 
the program beyond the point under consideration. Links which are not live can be set to 
null. This section describes the liveness analysis. In particular, we define liveness of heap 
references, devise a bounded representation called an access graph for liveness, and then 
propose a data flow analysis for discovering liveness. Other analyses required for safety of 
null insertion are described in Section 3. 

Our method is flow sensitive but context insensitive. This means that we compute point- 
specific information in each procedure by taking into account the flow of control at the 
intraprocedural level and by approximating the interprocedural information such that it is 
not context-specific but is safe in all calling contexts. For the purpose of analysis, arrays 
are handled by approximating any occurrence of an array element by the entire array. The 
current version models exception handling by explicating possible control flows. However, 
programs containing threads are not covered. 
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2.1 Access Paths 

In order to discover liveness and other properties of heap, we need a way of naming links 
in the memory graph. We do this using access paths. 

An access path is a root variable name followed by a sequence of zero or more field 

names and is denoted by p x =x->f\->f2-> >fk- Since an access path represents a path 

in a memory graph, it can be used for naming links and nodes. An access path consisting 
of just a root variable name is called a simple access path; it represents a path consisting of 
a single link corresponding to the root variable. £ denotes an empty access path. 

The last field name in an access path p is called its frontier and is denoted by Frontier(p). 
The frontier of a simple access path is the root variable name. The access path correspond- 
ing to the longest sequence of names in p excluding its frontier is called its base and is 
denoted by Base(p). Base of a simple access path is the empty access path £. The object 
reached by traversing an access path p is called the target of the access path and is denoted 
by Target(p). When we use an access path p to refer to a link in a memory graph, it 
denotes the last link in p, i.e. the link corresponding to Frontier(p). 

Example 2.1. As explained earlier, Figure 1(b) is the superimposition of memory 
graphs that can result before line 5 for different executions of the program. For the access 
path p x = x->lptr->lptr, depending on whether the while loop is executed 0, 1, 2, or 3 
times, Target(p x ) denotes nodes mj, nth, m m , or m/. Frontier(p x ) denotes one of the links 
nti — ► tn j, ntf —> nth, m g —> m m or m e — > mi. Base(p x ) represents the following paths in the 
heap memory: x — > m a — > w,-, x — > nib ~ * "if, x — ► m c — > m g or x — > mj — > m e . □ 

In the rest of the paper, a denotes an access expression, p denotes an access path and 
O denotes a (possibly empty) sequence of field names separated by ->. Let the access 
expression a* be x.f\ .fr Then, the corresponding access path p x is x->f\->fa .../„■ 

When the root variable name is not required, we drop the subscripts from a x and p x . 

2.2 Program Flow Graph 

Since the current version of our method involves context insensitive analysis, each pro- 
cedure is analyzed separately and only once. Thus there is no need of maintaining a call 
graph and we use the term program and procedure interchangeably. 

To simplify the description of analysis we make the following assumptions: 

— The program flow graph has a unique Entry and a unique Exit node. We assume that 
there is a distinguished main procedure. 

— Each statement forms a basic block. 

— The conditions that alter flow of control are made up only of simple variables. If not, the 
offending reference expression is assigned to a fresh simple variable before the condition 
and is replaced by the fresh variable in the condition. 

With these simplification, each statement falls in one of the following categories: 

— Function Calls. These are statements x = f(u y ,U z , . . .) where the functions involve ac- 
cess expressions in arguments. The type of x does not matter. 

— Assignment Statements. These are assignments to references and are denoted by a x = a y . 
Only these statements can modify the structure of the heap. 
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— Use Statements. These statements use heap references to access heap data but do not 
modify heap references. For the purpose of analysis, these statements are abstracted as 
lists of expressions a y .d where a } is an access expression and d is a non-reference. 

— Return Statement of the type return a x involving reference variable x. 

— Other Statements. These statements include all statements which do not refer to the 
heap. We ignore these statements since they do not influence heap reference analysis. 

When we talk about the execution path, we shall refer to the execution of the program 
derived by retaining all function calls, assignments and use statements and ignoring the 
condition checks in the path. 

For simplicity of exposition, we present the analyses assuming that there are no cycles 
in the heap. This assumption does not limit the theory in any way because our analyses 
inherently compute conservative information in the presence of cycles without requiring 
any special treatment. 

2.3 Liveness of Access Paths 

A link / is live at a program point p if it is used in some control flow path starting from p. 
Note that / may be used in two different ways. It may be dereferenced to access an object 
or tested for comparison. An erroneous nullification of I would affect the two uses in 
different ways: Dereferencing I would result in an exception being raised whereas testing 
/ for comparison may alter the result of condition and thereby the execution path. 

Figure 1(b) shows links that are live before line 5 by thick arrows. For a link / to be live, 
there must be at least one access path from some root variable to I such that every link in 
this path is live. This is the path that is actually traversed while using I. 

Since our technique involves nullification of access paths, we need to extend the notion 
of liveness from links to access paths. An access path is defined to be live at p if the 
link corresponding to its frontier is live along some path starting at p. Safety of null 
assignments requires that the access paths which are live are excluded from nullification. 

We initially limit ourselves to a subset of live access paths, whose liveness can be deter- 
mined without taking into account the aliases created before p. These access paths are live 
solely because of the execution of the program beyond p. We call access paths which are 
live in this sense as explicitly live access paths. An interesting property of explicitly live 
access paths is that they form the minimal set covering every live link. 

Example 2.2. If the body of the while loop in Figure 1(a) is not executed even once, 
Target(y) = mi at line 5 and the link m, — > mj is live at line 5 because it is used in line 6. 
The access paths y and y->lptr are explicitly live because their liveness at 5 can be deter- 
mined solely from the statements from 5 onwards. In contrast, the access path w->lptr->lptr 
is live without being explicitly live. It becomes live because of the alias between y and 
w->lptr and this alias was created before 5. Also note that if an access path is explicitly 
live, so are all its prefixes. □ 

Example 2.3. We illustrate the issues in determining explicit liveness of access paths 
by considering the assignment x.r.n =y.n.n. 

— Killed Access Paths. Since the assignment modifies Frontier{x->r->n), any access path 
which is live after the assignment and has x->r->n as prefix will cease to be live before 
the assignment. Access paths that are live after the assignment and not killed by it are 
live before the assignment also. 
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— Directly Generated Access Paths. All prefixes of x->r and y->n are explicitly live before 
the assignment due to the local effect of the assignment. 

— Transferred Access Paths. If x-or->n->a is live after the assignment, then y->n->n->G will 
be live before the assignment. For example, if x->r->n->n is live after the assignment, 
then y->n->n->n will be live before the assignment. The sequence of field names a is 
viewed as being transferred from x->r->n to y->n->n. □ 

We now define liveness by generalizing the above observations. We use the notation 
p x ->* to enumerate all access paths which have p x as a prefix. The summary liveness 
information for a set S of reference variables is defined as follows: 



xes 

Further, the set of all global variables is denoted by Globals and the set of formal parame- 
ters of the function being analyzed is denoted by Params. 

Definition 2.1. Explicit Liveness. The set of explicitly live access paths at a program 
point p, denoted by Liveness p is defined as follows. 



where, \|/ £ Paths(p) is a control flow path p to Exit and PathLivenessJ denotes the liveness 
at p along \|/ and is defined as follows. If p is not program exit then let the statement which 
follows it be denoted by s and the program point immediately following s be denoted by 
p' . Then, 



where the flow function for s is defined as follows: 

StatementLiveness s {X) = (X — LKills) U LDirect s U LTransfer S (X) 

LKill s denotes the sets of access paths which cease to be live before statement s, LDirects 
denotes the set of access paths which become live due to local effect of s and LTransfer S (X) 
denotes the the set of access paths which become live before s due to transfer of liveness 
from live access paths after s. They are defined in Figure 2. □ 

Observe that the definitions of LKills, LDirect s , and LTransfer s (X) ensure that the Liveness p 
is prefix-closed. 

EXAMPLE 2.4. In Figure 1 , it cannot be statically determined which link is represented 
by access expression x.lptr at line 4. Depending upon the number of iterations of the while 
loop, it may be any of the links represented by thick arrows. Thus at line 1, we have to as- 
sume that all access paths {x->lptr->lptr, x->rptr->lptr->lptr, x->rptr->rptr->lptr->lptr, . ..} 
are explicitly live. □ 

In general, an infinite number of access paths with unbounded lengths may be live before 
a loop. Clearly, performing data flow analysis for access paths requires a suitable finite 
representation. Section 2.4 defines access graphs for the purpose. 
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Statement s 


LKill s 


LDirect s 


LTransfer S {X) 


Ux = Uy 


{p x ->*} 


Prefixes ( Base ( p x ) ) U Prefixes ( Base (p y )) 


{p y ->a | p*-Xj e X} 


a x = f(a y ) 


{p x ->*} 


Prefixes (Base(p x )) 
U Summary ({p y .} U Globals) 





a x = new 


{p x ->*} 


Prefixes (Base(p x )) 





a x = null 


{p x ->*} 


Prefixes (Base(p x )) 





use Oy.d 





Prefixes(p y ) 





return a y 





Summary{{py}) 





other 












Fig. 2. Defining Flow Functions for Liveness. Globals denotes the set of global references and Params denotes 
the set of formal parameters. For simplicity, we have shown a single access expression on the RHS. 



2.4 Representing Sets of Access Paths by Access Graphs 

In the presence of loops, the set of access paths may be infinite and the lengths of access 
paths may be unbounded. If the algorithm for analysis tries to compute sets of access paths 
explicitly, termination cannot be guaranteed. We solve this problem by representing a set 
of access paths by a graph of bounded size. 

2.4.1 Defining Access Graphs. An access graph, denoted by G v , is a directed graph 
(riQ,N,E) representing a set of access paths starting from a root variable v. 1 N is the set of 
nodes, «o S Np is the entry node with no in-edges and E is the set of edges. Every path in 
the graph represents an access path. The empty graph Eg has no nodes or edges and does 
not accept any access path. 

The entry node of an access graphs is labeled with the name of the root variable while 
the non-entry nodes are labeled with a unique label created as follows: If a field name / 
is referenced in basic block b, we create an access graph node with a label (f,b,i) where i 
is the instance number used for distinguishing multiple occurrences of the field name / in 
block b. Note that this implies that the nodes with the same label are treated as identical. 
Often, i is and in such a case we denote the label (/, b, 0) by ft, for brevity. Access paths 
p x ->* are represented by including a summary node denoted n* with a self loop over it. It 
is distinct from all other nodes but matches the field name of any other node. 

A node in the access graph represents one or more links in the memory graph. Addi- 
tionally, during analysis, it represents a state of access graph construction (explained in 
Section 2.4.2). An edge /„ — > g m in an access graph at program point p indicates that a 
link corresponding to field / dereferenced in block n may be used to dereference a link 
corresponding to field g in block m on some path starting at p. This has been used in 
Section 5.2 to argue that the size of access graphs in practical programs is small. 

Pictorially, the entry node of an access graph is indicated by an incoming double arrow. 

2.4.2 Summarization. Recall that a link is live at a program point p if it is used along 
some control flow path from p to Exit. Since different access paths may be live along 
different control flow paths and there may be infinitely many control flow paths in the case 



'Where the root variable name is not required, we drop the subscript v from G v . 
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Live access paths at entry of block 1: {x, x->r, x->r->r, x->r->r->r, . ..} 
Corresponding access graph: G\ = ^(*) — *"@Q 

Live access paths at entry of block 1: {x, x->r, x->r-i>r} 
Corresponding access graph: G\ ^ — "(5) — 
Fig. 3. Approximations in Access Graphs 



of a loop following p, there may be infinitely many access paths which are live at p. Hence, 
the lengths of access paths will be unbounded. In such a case summarization is required. 

Summarization is achieved by merging appropriate nodes in access graphs, retaining all 
in and out edges of merged nodes. We explain merging with the help of Figure 3: 

— Node ni m access graph G x indicates references of n at different execution instances 
of the same program point. Every time this program point is visited during analysis, 
the same state is reached in that the pattern of references after n\ is repeated. Thus all 
occurrences of n\ are merged into a single state. This creates a cycle which captures the 
repeating pattern of references. 

— In G\, nodes n\ and «2 indicate referencing n at different program points. Since the 
references made after these program points may be different, n\ and «2 are not merged. 

Summarization captures the pattern of heap traversal in the most straightforward way. 
Traversing a path in the heap requires the presence of reference assignments a x = a y such 
that p x is a proper prefix of p y . Assignments in Figure 3 are examples of such assignments. 
The structure of the flow of control between such assignments in a program determines the 
pattern of heap traversal. Summarization captures this pattern without the need of control 
flow analysis and the resulting structure is reflected in the access graphs as can be seen 
in Figure 3. More examples of the resemblance of program structure and access graph 
structure can be seen in the access graphs in Figure 6. 

2.4.3 Operations on Access Graphs. Section 2.3 defined liveness by applying certain 
operations on access paths. In this subsection we define the corresponding operations on 
access graphs. Unless specified otherwise, the binary operations are applied only to access 
graphs having same root variable. The auxiliary operations and associated notations are: 

— Root(p) denotes the root variable of access path p, while Root(G) denotes the root 
variable of access graph G. 

— Field(n) for a node n denotes the field name component of the label of n. 

— G (p ) constructs access graphs corresponding to p . It uses the current basic block number 
and the field names to create appropriate labels for nodes. The instance number depends 
on the number of occurrences of a field name in the block. G(p->*) creates an access 
graph with root variable x and the summary node with an edge from x to n* and a self 
loop overn*. 

— lastNode(G) returns the last node of a linear graph G constructed from a given p. 
— CleanUp(G) deletes the nodes which are not reachable from the entry node. 
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53 
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^0 


1 


X = x.l 




2 


y — x.r.d 




54 


^0^0 


-0^0 

















Union 


Path Removal 


Factorisation 


Extension 


53 tbl 54 = 54 
§2 W 54 = 55 

55 tbl 54 = 55 

55 tbl 56 = 56 


g6&X->l =52 
55 0X = £ G 

54 0x^r = 54 
5 4 ex-t>/ =51 


52/ (51 >{-*}) = {rgi} 
55/(51, W) = {rgurg 2 } 
5s/ (52, {/ 2 }) = {£«;} 

54/ (52, W) = 


(53, {'1}) #{'-51} =54 
(53,{x^i})#{' - 5i I '"52} =56 

(52,{'-2})#{e / J G }=52 

(52,{'-2})#e» = £ G 



Fig. 4. Examples of operations on access graphs. 



ACN(G,G') 



— CN(G,G' ,S) computes the set of nodes of G which correspond to the nodes of G' spec- 
ified in the set S. To compute CN(G,G' ,S), we define ACN(G,G'), the set of pairs of 
all corresponding nodes. Let G = (no,N,E) and G' = {n' ,N' ,E'). A node n in G corre- 
sponds to a node n 1 in G' if there there exists an access path p which is represented by a 
path from «o to n in G and a path from n' to n' in G'. 
Formally, ACN(G, G') is the least solution of the following equation: 

' Root(G) ^ Root{G') 

{(no,n' Q )} U {(nj,n'j) | Field(nj) = Field(n'j) 7 otherwise 
n; — > «/ G E,rc- — > n'- G E', 
(niX) e ACN{G,G')} 

CN(G,G',S) = {n\ («,«') eACN(G,G'),ri eS} 

Note that Field(rij) = Field(n'j) would hold even when rij or is the summary node n*. 

Let G = (no,N,E) and G' = (no,N',E') be access graphs (having the same entry node). 
G and G' are equal if N = N' and £ = E'. 

The main operations of interest are defined below and are illustrated in Figure 4. 

(1) Union ( tbl ). G tbl G' combines access graphs G and G' such that any access path con- 
tained in G or G' is contained in the resulting graph. 

G tbl G' = (n ,NUN',EUE') 

The operation NUN' treats the nodes with the same label as identical. Because of 
associativity, tbl can be generalized to arbitrary number of arguments in an obvious 
manner. 

(2) Path Removal (0). The operation G0 p removes those access paths in G which have 
p as a prefix. 

{G p = £ or Root(p) ^ Root(G) 

"Eg p is a simple access path 

CleanUp((no,N,E — Em}) otherwise 
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where 

Edei = {tH -> nj | m -» «/ G £,«; G CN(G,G B , {lastNode(G B )}), 
Field (nj) = Frontier(p),G B = G(Base(p)), 
UniqueAccessPath?(G, n,-) } 

UniqueAccessPath?(G, n) returns true if in G, all paths from the entry node to node 
n represent the same access path. Note that path removal is conservative in that some 
paths having p as prefix may not be removed. Since an access graph edge may be 
contained in more than one access paths, we have to ensure that access paths which do 
not have p as prefix are not erroneously deleted. 

(3) Factorization (/). Recall that the Transfer term in Definition 2.1 requires extracting 
suffixes of access paths and attaching them to some other access paths. The corre- 
sponding operations on access graphs are performed using factorization and extension. 
Given a node m&(N— {no}) of an access graph G, the Remainder Graph of G at m 
is the subgraph of G rooted at m and is denoted by RG(G, m). If m does not have any 
outgoing edges, then the result is the empty remainder graph Erg- Let M be a subset 
of the nodes of G' and M' be the set of corresponding nodes in G. Then, Gj (G',M) 
computes the set of remainder graphs of the successors of nodes in M' . 

G/(G',M) = {RG(G, nj ) \ m -> nj G E,n f G CN(G,G',M)} (1) 

A remainder graph is similar to an access graph except that (a) its entry node does 
not correspond to a root variable but to a field name and (b) the entry node can have 
incoming edges. 

(4) Extension. Extending an empty access graph Eg results in the empty access graph 'Lq. 
For non-empty graphs, this operation is defined as follows. 

(a) Extension with a remainder graph (•)• Let M be a subset of the nodes of G and 
R = (n',N R , E R ) be a remainder graph. Then, (G,M) ■ R appends the suffixes in 
R to the access paths ending on nodes in M. 

{G,M)-E RG = G 

{G,M)-R = (n ,N(JN R ,E(JE R U {m -» ri | m 6 M}) (2) 

(b) Extension with a set of remainder graphs (#). Let S be a set of remainder graphs. 
Then, G#S extends access graph G with every remainder graph in S. 

(G,M)#0 = £ G 

(G,M)#S = [+J {G,M)-R (3) 

Res 

2.4.4 Safety of Access Graph Operations. Since access graphs are not exact represen- 
tations of sets of access paths, the safety of approximations needs to be defined explicitly. 
The constraints defined in Figure 5 capture safety in the context of liveness in the following 
sense: Every access path which can possibly be live should be retained by each operation. 
Since the complement of liveness is used for nullification, this ensures that no live access 
path is considered for nullification. These properties have been proved [Iyer 2005] using 
the PVS theorem prover 2 . 



2 Available from http : //pvs . csl . sri . com. 
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Access Graphs 


Access Paths 
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G 3 = Gi a G 2 


P(G 3 ,M 3 ) D P(Gi,Mi)U P(G 2 .M 2 ) 


Path Removal 


g 2 = Gi e p 


P(G 2 ,M 2 ) D P(G 1: M!) - {p->a | p-r>a 6 P(G 1 ,M 1 )} 


Factorization 


S = Gi/(G 2 ,M) 


P(S,M J ) = {a | p'-l>oe P(Gi,M 1 ),p' e P(G 2 ,M)} 


Extension 


G 2 = (G U M)#S 


P(G 2 ,M 2 ) D P(Gi,M!)U{p->a|p G P(G u M),a e P(S,M S )} 



Fig. 5. Safety of Access Graph Operations. P(G,M) is the set of paths in graph G terminating on nodes in M. 
For graph G;, M, is the set of all nodes in G,. 5 is the set of remainder graphs and P(S,M S ) is the set of all paths 
in all remainder graphs in 5. 



2.5 Data Flow Analysis for Discovering Explicit Liveness 

For a given root variable v, ELIn v (;) and EL0ut v ,(;) denote the access graphs representing 
explicitly live access paths at the entry and exit of basic block i. We use E G as the initial 
value for ELIn v (/)/ELOut v (/). 



ELIn v (/) = (ELOut v (/) ELKillPath v (/)) l±l ELGen v (z) (4) 
G(v-»*) i = Exit, v e Globals 

... I Eg i = Exit, v Globals 

EL0ut v ,(;) = < it i / \ i • (5) 
w I [+) ELIn v (5') otherwise 

sesucc(i) 

where 

ELGen v (;) = LDirect v (i) ttJ LTransfer v (i) 

We define ELKillPath v ,(;), LDirect v (i), and LTransfer v (i) depending upon the statement. 

(1) Assignment statement a x = a y . Apart from defining the desired terms for x and y, we 
also need to define them for any other variable z. In the following equations, G x and 
G y denote G(p x ) and G(p y ) respectively, whereasM* andM y denote lastNode(G(p x )) 
and lastNode(G (p y )) respectively. 



LDirect x (i) 
LDirect y (i) 
LDirect z (i) 



G(Base(p x )) 

{•Lq Oy is New ... or null 

G(Base(py)) otherwise 
Eg, for any variable z other than x and y 
' Eg OCj. is New or null 



LTransfer y (i) = { (G y ,M y )# 



otherwise 



(6) 



(ELOut x (i)/(G x ,M x )) 
LTransfer z (i) = "Eg, for any variable z other than y 
ELKillPath^/) = p x 

ELKillPath,(/) = E, for any variable z other than x 

As stated earlier, the path removal operation deletes an edge only if it is contained in 
a unique path. Thus fewer paths may be killed than desired. This is a safe approxima- 
tion. Another approximation which is also safe is that only the paths rooted at x are 
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Fig. 6. Explicit liveness for the program in Figure 1 under the assumption that all variables are local variables. 



killed. Since assignment to a* changes the link represented by Frontier(p x ), for pre- 
cision, any path which is guaranteed to contain the link represented by Frontier(p x ) 
should also be killed. Such paths can be discovered through must-alias analysis which 
we do not perform. 

(2) Function call a x = f{<X y ). We conservatively assume that a function call may make 
any access path rooted at y or any global reference variable live. Thus this version of 
our analysis is context insensitive. 

G(Base(p x )) 
G(p,H>*) 

G(z->*) if z is a global variable 
Eg otherwise 
Eg, for a H variables z 

Px 

£ , for any variable z other than x 



LDirect x (i) 
LDirect y (i) 

LDirect z {i) 



LTransfer z (i) 
ELKillPatVO 
ELKillPath z (i) 
(3) Return Statement return <X X . 

LDirect x (i) ■ 



LDirect z (i) = 

LTransfer z (i) = 
ELKillPath,(/) = 



G(p x ->*) 

( G(z->*) if z is a global variable 
\ E G otherwise 
Eg, for an y variable z 
E , for any variable z 
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1 x = y 



x.n = New 



= nu//? 



.x.n.n = New 



Fig. 7. Explicit liveness information is not sufficient for nullification. 



(4) Use Statements 

LDirect x (i) = [+j G(p x ) for every a x .d used in ;' 

LDirect z (i) = T,q for any variable z other than x andy 

LTransfer z (i) = Eg, for every variable z 

ELKillPath z (/) = E, for every variable z 

Example 2.5. Figure 6 lists explicit liveness information at different points of the 
program in Figure 1 under the assumption that all variables are local variables. □ 

Observe that computing liveness using equations (4) and (5) results in an MFP (Max- 
imum Fixed Point) solution of data flow analysis whereas definition (2.1) specifies an 
MoP (Meet over Paths) solution of data flow analysis. Since the flow functions are non- 
distributive (see appendix A), the two solutions may be different. 



3. OTHER ANALYSES FOR INSERTING NULL ASSIGNMENTS 

Explicit liveness alone is not enough to decide whether an assignment <x x = null can be 
safely inserted at p. We have to additionally ensure that: 

— Frontier(p x ) is not live through an alias created before the program point p. The ex- 
tensions required to find all live access paths, including those created due to aliases, is 
discussed in section 3.1. 

— Dereferencing links during the execution of the inserted statement a x = null does not 
cause an exception. This is done through availability and anticipability analysis and is 
described in section 3.2. 

Both these requirements are illustrated through the example shown below: 

Example 3.1. In Figure 7, access path y->n is not explicitly live in block 6. However, 
Frontier (y-on) and Frontier(x->n) represent the same link due to the assignment x = y. 
Thus y->n is implicitly live and setting it to null in block 6 will raise an exception in block 
7. Also, x->n->n is not live in block 2. However, it cannot be set to null since the object 
pointed to by x->n does not exist in memory when the execution reaches block 2. Therefore, 
insertion of x.n.n = null in block 2 will raise an exception at run-time. □ 
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3.1 Computing Live Access Paths 

Recall that an access path is live if it is either explicitly live or shares its Frontier with 
some explicitly live path. The property of sharing is captured by aliasing. Two access 
paths p x and p y are aliased at a program point p if Target(p x ) is same as Target(p y ) at p 
during some execution of the program. They are link-aliased if their frontiers represent the 
same link; they are node-aliased if they are aliased but their frontiers do not represent the 
same link. Link-aliases can be derived from node-aliases (or other link-aliases) by adding 
the same field names to aliased access paths. 

Alias information is flow-sensitive if the aliases at a program point depend on the state- 
ments along control flow paths reaching the point. Otherwise it is flow insensitive. Among 
flow sensitive aliases, two access paths are must-aliased at p if they are aliased along every 
control flow path reaching p; they are may-aliased if they are aliased along some control 
flow path reaching p. As an example, in Figure 1, x->lptr and y are must-node-aliases, 
x->lptr->lptr and y->lptr are must-link-aliases, and w and x are node-aliases at line 5. 

We compute flow sensitive may-aliases (without kills) using the algorithm described 
by Hind et al. [1999] and use pairs of access graphs for compact representation of aliases. 
Liveness is computed through a backward propagation much in the same manner as explicit 
liveness except that it is ensured that the live paths at each program point is closed under 
may-aliasing. This requires the following two changes in the earlier scheme. 

(1) Inclusion of Intermediate Nodes in Access Graphs. Unlike explicit liveness, live access 
paths may not be prefix closed. This is because the frontier of a live access path p x may 
be accessed using some other access path and not through the links which constitute 
p x . Hence prefixes of p x may not be live. In an access graph representing liveness, 
all paths may not represent live links. We therefore modify the access graph so that 
such paths are not described by the access graph. In order to make this distinction, we 
divide the nodes in an access graphs in two categories: final and intermediate. The 
only access paths described by the access graph are those which end at final nodes. 3 
This change affects the access graph operations in the following manner: 
— The equality of graphs now must consider equality of the sets of intermediate nodes 

and the sets of final nodes separately. 
— Graph constructor G{p x ) marks all nodes in the resulting graph as final implying 
that all non-empty prefixes of p x are contained in the graph. We define a new con- 
structor GOnly(p x ) which marks only the last node as final and all other nodes as 
intermediate implying that only p x is contained in the graph. 
— Whenever multiple nodes with identical labels are combined, if any instance of the 
node is final then the resulting node is treated as final. This influences union (l±J) 
and extension (#). 

— The set M used in defining factorization and extension (equations 1,2, 3) and the 
safety properties of access graph operations (Figure 5) contain final nodes only. 

— Extension G ■ RG marks all nodes in G as intermediate. If G and RG have a common 
node then the status of the node is governed by its status in RG. 

— The Clean Up(G) operation is modified to delete those intermediate nodes which do 
not have a path leading to a final node. 



3 These two categories are completely orthogonal to the labeling criterion of the nodes. 
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Fig. 8. Alias pairs for the running example from Figure (1). 

(2) Link Alias Closure. To discover all link aliases of a live access we compute link alias 
closure as defined below. Given an alias set AS, the set of link aliases of an access 
path p x ->f is the least solution of: 

LnA(p x ->/,4S) = {pjr>/ | (p x ,p y ) 6 AS or (p x ,p y ) e LnA(p x ,/\S)} 

Given an alias pair (g x ,gy) link aliases of rooted at y are included in the access 
graph G y as follows: 

LnG(G y ,G x ,(g x ,g y )) = G y V(g y ,m y )#((G x /(g x ,m x ))-E RG ) (7) 

where m y and m x are the singleton sets containing the final nodes of g y and g x re- 
spectively. Erg has to be removed from set of remainder graphs because we want to 
transfer non-empty links only. Complete liveness is computed as the least solution of 
the following equations 

Lln v (z) = TLIn v (z) |+J LnG (Lln v (i),Lln M (i), (g v ,gu)) 
(gv,gu)eA\n(i) 

G(v->*) i = Exit, v e Globals 

or v e Params 

I £ G W LnG(LOut y (;),LOut M (0,(gv,g„}) i = Exit, v £ Globals, 
LOutvW ~ <g v , gB )eAOut(i) 

|+) Lln v (i) otherwise 

where TLIn v (z) is same as ELIn v (z) except that ELOut y (z) is replaced by LOut v (z) in 
the main equation (equation 4) and in the computation of Transfer (equation 6). 

Example 3.2. Figure 7 shows the may-alias information for our running example 
from Figure 1. Observe that the access graphs used for storing alias information have 
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Fig. 9. Liveness access graphs including implicit liveness information for the program in Figure 1. Gray nodes 
are nodes included by link-alias computation. Intermediate nodes are shown with dotted lines. 



only the last node as final and all other nodes as intermediate. Figure 9 shows the liveness 
access graphs augmented with implicit liveness. □ 

Observe that in the presence of cyclic data structures, we will get alias pairs of the 
form (p,p->o). If a link in the cycle is live then the link alias closure will ensure that all 
possible links are marked live by creating cycles in the access graphs. This may cause 
approximation but would be safe. 

3.2 Availability and Anticipability of Access Paths 

Example 3.1 shows that safety of inserting an assignment a* = null at a program point p 
requires that whenever control reaches p, every prefix of Base(p x ) has a non-null 1-value. 
Such an access path is said to be accessible at p. Our use of accessibility ensures the 
preservation of semantics in the following sense: Consider an execution path which does 
not have a dereferencing exception in the unoptimized program. Then the proposed opti- 
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Fig. 10. Flow functions for availability. Aln(j) denotes the set of may-aliases at the entry of s. 



mization will also not have any dereferencing exception in the same execution path. 

3.2.1 Defining Availability and Anticipability. We define an access path to be ac- 
cessible at p if all of its prefixes are available or anticipable at p: 

— An access path p x is available at a program point p, if along every path reaching p, 
there exists a program point p' such that Frontier(p x ) is either dereferenced or assigned 
a non-null 1-value at p' and is not made null between p' to p. 

— An access path p x is anticipable at a program point p, if along every path starting from 
p, Frontier(p x ) is dereferenced before being assigned. 

Since both these properties are all paths properties, all may-link aliases of the left hand 
side of an assignment need to be killed. Conversely, these properties can be made more 
precise by including must-aliases in the set of anticipable or available paths. 

Recall that comparisons in conditionals consists of simple variables only. The use of 
these variables does not involve any dereferencing. Hence a comparison x == y does not 
contribute to accessibility of x or y. 

Definition 3.1. Availability. The set of paths which are available at a program point p, 
denoted by Avail p , is defined as follows. 

Avail p = P| (PathAvailJ) 

yePaths(p) 

where, \\f € Paths(p) is a control flow path Entry to p and PathAvailJ denotes the avail- 
ability at p along \\t and is defined as follows. If p is not Entry of the procedure being 
analyzed, then let the statement which precedes it be denoted by s and the program point 
immediately preceding s be denoted by p'. Then, 

P hA 'ft ( ^ ^ ^ ^ nir y 

p y StatementAvailslPathAvaifty) otherwise 

where the flow function for s is defined as follows: 

StatementAvail s (X) = (X — AvKill s ) U AvDirect s U AvTransfer S {X) 

AvKill s denotes the sets of access paths which cease to be available after statement s, 
AvDirects denotes the set of access paths which become available due to local effect of 
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Fig. 1 1 . Flow functions for anticipability. AOut(.s) denotes the set of may-aliases at the exit of s. 



s and AvTransfer S (X) denotes the the set of access paths which become available after s 
due to transfer. They are defined in Figure 10. □ 

In a similar manner, we define anticipability of access paths. 

Definition 3.2. Anticipability. The set of paths which are anticipable at a program 
point p, denoted by Ant p is defined as follows. 

Ant p = P| (PathAntJ) 

\\iePaths(p) 

where, \|/ g Paths(p) is a control flow path p to Exit and PathAvailJ denotes the antici- 
pability at p along \\i and is defined as follows. If p is Exit then let the statement which 
follows it be denoted by s and the program point immediately following s be denoted by 
p'. Then, 

P thA f v I " ^ ^ 

p \ Statement Ant s(PathAni^) otherwise 

where the flow function for s is defined as follows: 

Statement Ant S {X) = (X—AnKill s ) U AnDirect s U AriTransfer S {X) 

AnKill s denotes the sets of access paths which cease to be anticipable before statement s, 
AnDirect s denotes the set of access paths which become anticipable due to local effect of s 
and AnTransfer S (X) denotes the the set of access paths which become anticipable before s 
due to transfer. They are defined in Figure 11. □ 

Observe that both Avail p and Ant p are prefix-closed. 

3.2.2 Data Flow Analyses for Availability and Anticipability. Availability and Antici- 
pability are all (control-flow) paths properties in that the desired property must hold along 
every path reaching/leaving the program point under consideration. Thus these analyses 
identify access paths which are common to all control flow paths including acyclic control 
flow paths. Since acyclic control flow paths can generate only acyclic 4 and hence finite 



4 In the presence of cycles in heap, considering only acyclic access paths results in an approximation which is safe 
for availability and anticipability. 
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Fig. 12. Availability and anticipability for the program in Figure 1. 



access paths, anticipability and availability analyses deal with a finite number of access 
paths and summarization is not required. 

Thus there is no need to use access graphs for availability and anticipability analyses. 
The data flow analysis can be performed using a set of access paths because the access 
paths are bounded and the sets would be finite. Moreover, since the access paths resulting 
from anticipability and availability are prefix-closed, they can be represented efficiently. 

The data flow equations are same as the definitions of these analyses except that def- 
initions are path-based (i.e. they define MoP solution) while the data flow equations are 
edge-based (i.e. they define MFP solution) as is customary in data flow analysis. In other 
words, the data flow information is merged at the intermediate points and availability and 
anticipability information is derived from the corresponding information at the preceding 
and following program point respectively. As observed in appendix A, the flow functions in 
availability and anticipability analyses are non-distributive hence MoP and MFP solutions 
may be different. 

For brevity, we omit the data flow equations. We use the universal set of access paths 
as the initial value for all blocks other than Entry for availability analysis and Exit for 
anticipability analysis. 

EXAMPLE 3.3. Figure 12 gives the availability and anticipability information for pro- 
gram in Figure 1. Avln(/) and AvOut(z') denote the set of available access paths before 
and after the statement i, while Anln (/) and AnOut(/) denote the set of anticipable access 
paths before and after the statement ;'. □ 

4. NULL ASSIGNMENT INSERTION 

We now explain how the analyses described in preceding sections can be used to insert 
appropriate null assignments to nullify dead links. The inserted assignments should be 
safe and profitable as defined below. 

Definition 4.1. Safety. It is safe to insert an assignment a = null at a program point p if 
and only if p is not live at p and Base(p) can be dereferenced without raising an exception. 

An access path p is nullable at a program point p if and only if it is safe to insert 
assignment a = null at p. 

Definition 4.2. Profitability. It is profitable to insert an assignment a = null at a pro- 
gram point p if and only if no proper prefix of p is nullable at p and the link corresponding 
to Frontier(p) is not made null before execution reaches p. 
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Note that profitability definition is strict in that every control flow path may nullify a 
particular link only once. Redundant null assignments on any path are prohibited. Since 
control flow paths have common segments, a null assignment may be partially redundant 
in the sense that it may be redundant along one path but not along some other path. Such 
null assignments will be deemed unprofitable by Definition 4.2. Our algorithm may not be 
able to avoid all redundant assignments. 

Example 4.1. We illustrate some situations of safety and profitability for the program 
in Figure 1 . 

— Access path x->lptr->lptr is not nullable at the entry of 6. This is because x->lptr->lptr is 
implicitly live, due to the use of y->lptr in 6. Hence it is not safe to insert x.lptr.lptr = null 
at the entry of 6. 

— Access path x->rptr is nullable at the entry of 4, and continues to be so on the path from 
the entry of 4 to the entry of 7. The assignment x.rptr = null is profitable only at the 
entry of 4. □ 

Section 4.1 describes the criteria for deciding whether a given path p should be consid- 
ered for a null assignment at a program point p. Section 4.2 describes how we create the 
set of candidate access paths. Let Live(p), Available(p), and Anticipable(p) denote set of 
live paths, set of available paths and set of anticipable paths respectively at program point 
p. 5 They refer to Lin (;), Avln (/), and Anln(/) respectively when p is In,-. When p is Out,, 
they refer to LOut(/), AvOut(/), and AnOut(/) respectively. 

4.1 Computing Safety and Profitability 

To find out if p can be nullified at p, we compute two predicates: Nullable and Nullify. 
Nullable(p , p) captures the safety property — it is true if insertion of assignment a = null 
at program point p is safe. 

Nullable(p , p) — p ^ Live(p) A Base(p) £Available(p)UAnticipable(p) (8) 

Nullify (p,p) captures the profitability property — it is true if insertion of assignment 
a = null at program point p is profitable. To compute Nullify, we note that it is most 
profitable to set a link to null at the earliest point where it ceases to be live. Therefore, 
the Nullify predicate at a point has to take into account the possibility of null assignment 
insertion at previous point(s). For a statement ; in the program, let In, and Out, denote the 
program points immediately before and after i. Then, 

Nullify(p,Outi) = Nullable(p,Outi)A( f\ p' £ Live(Out;)) 

p'ePmperPrefix(p) 

A (-iiVw//aWe(p,ln,-)V -^Transp(p,i)) (9) 
Nullify {p, \r\i) = Nullable{p,\ni) h{ f\ p' £Zjve(ln,-)) 

p' €ProperPrefix(p) 

A p ^ lhs(i) A (-. f\ Nullable(p,Outj)) (10) 

jepred(i) 



5 Because availability and anticipability properties are prefix closed, Base(p) £ Available(p) U Anticipable(p) 
guarantees that all proper prefixes of p are either available or anticipable. 
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where, Transp(p, i) denotes that p is transparent with respect to statement i, i.e. no prefix of 
p is may-link-aliased to the access path corresponding to the lhs of statement i at In,. Ihs(i) 
denotes the access path corresponding to the lhs access expression of assignment in state- 
ment ;'. pred(i) is the set of predecessors of statement ;' in the program. ProperPrefix(p ) is 
the set of all proper prefixes of p. 

We insert assignment a = null at program point p if Nullify (p,p) is true. 

4.2 Computing Candidate Access Paths for null Insertion 

The method described above only checks whether a given access path p can be nullified 
at a given program point p. We can generate the candidate set of access paths for null 
insertion at p as follows: For any candidate access path p, Base(p) must either be avail- 
able or anticipable at p. Additionally, all simple access paths are also candidates for null 
insertions. Therefore, 

Candidates(p) = {p->f | p € Available(p) U Anticipable(p),f 6 OutField(p)} 

U {p | p is a simple access path } (11) 

Where OutField(p) is the set of fields which can be used to extend access path p at p. It 
can be obtained easily from the type information of the object Target(p) at p. 

Note that all the information required for equations (8), (9), (10), and (11) is obtained 
from the result of data flow analyses described in preceding sections. Type information of 
objects required by equation (1 1) can be obtained from the front end of compiler. Transp 
uses may alias information as computed in terms of pairs of access graph. 

Example 4.2. Figure 13 lists a trace of the null insertion algorithm for the program 
in Figure 1 . □ 

4.3 Reducing Redundant null Insertions 

Consider a program with an assignment statement ; : <x x = <x y . Assume a situation where, 
for some non-empty suffix o, both Nullify (p y ->G, In,) and Nullify (p x ->o, Out,) are true. In 
that case, we will be inserting a v .O = null at In, and <x x .G = null at Out,. Clearly, the 
latter null assignment is redundant in this case and can be avoided by checking if p y -i>o is 
nullable at In,. 

If must-alias analysis is performed then redundant assignments can be reduced further. 
Since must-link-alias relation is symmetric, reflexive, and transitive and hence an equiva- 
lence relation, the set of candidate paths at a program point can be divided into equivalence 
classes based on must-link-alias relation. Redundant null assignments can be reduced by 
nullifying at most one access path in any equivalence class. 

5. CONVERGENCE OF HEAP REFERENCE ANALYSIS 

The null assignment insertion algorithm makes a single traversal over the control flow 
graph. We show the termination of liveness analysis using the properties of access graph 
operations. Termination of availability and anticipability can be shown by similar argu- 
ments over finite sets of bounded access paths. Termination of alias analysis follows from 
Hind etal. [1999]. 

5.1 Monotonicity 

For a program there are a finite number of basic blocks, a finite number of fields for any 
root variable, and a finite number of field names in any access expression. Hence the 
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Fig. 13. Null insertion for the program in Figure 1. 
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number of access graphs for a program is finite. Further, the number of nodes and hence 
the size of each access graph, is bounded by the number of labels which can be created for 
a program. 

Access graphs for a variable x form a complete lattice with a partial order C G induced by 
l+l . Note that t±J is commutative, idempotent, and associative. Let G = {x,Nf,Ni,E) and 
G' = (x,N' F ,N'j,E') where subscripts F and / distinguish between the final and intermediate 
nodes. The partial order C G is defined as 

CC G G'« (N' f C N f ) A (Nj C (N f UAT/)) a (e' c e) 

Clearly, G C G G' implies that G contains all access paths of G'. We extend C G to a set of 
access graphs as follows: 

Si C S S 2 «■ VG 2 G S 2) 3d G Si s.t. d C G G 2 

It is easy to verify that C G is reflexive, transitive, and antisymmetric. For a given variable 
x, the access graph £ G forms the T element of the lattice while the _L element is a greatest 
lower bound of all access graphs. 

The partial order over access graphs and their sets can be carried over unaltered to re- 
mainder graphs ( ^rg) and their sets with the added condition that £# G is incompa- 
rable to any other non empty remainder graph. 

Access graph operations are monotonic as described in Figure 14. Path removal is mono- 
tonic in the first argument but not in the second argument. Similarly factorization is mono- 
tonic in the first argument but not in the second and the third argument. However, we show 
that in each context where they are used, the resulting functions are monotonic: 

(1) Path removal is used only for an assignment a* = a y . It is used in liveness analysis 
and its second argument is p x which is constant for any assignment statement a x = a y . 
Thus the resulting flow functions are monotonic. 

(2) Factorization is used in the following situations: 

(a) Link-alias closure of access graphs. From equation (7) it is clear LnG is mono- 
tonic in the first argument (because it is used in W ) and the second argument 
(because it is supplied as the first argument of factorization). The third and the 
fourth arguments of LnG are linear access graphs containing a single path and 
hence are incomparable with any other linear access graph. Thus link-alias com- 
putation is monotonic in all its arguments. 

(b) Liveness analysis. Factorization is used for the flow function corresponding to 
an assignment a x — <X y and its second argument is G(p^) while its third argu- 
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ment is lastNode(G(p x )) both of which are constant for any assignment statement 
a x = <X y . Thus, the resulting flow functions are monotonic. 

Thus we conclude that all flow functions are monotonic. Since lattices are finite, termina- 
tion of heap reference analysis follows. 

Appendix A discusses the distributivity of flow functions. 

5.2 Complexity 

This section discusses the issues which influence the complexity and efficiency of perform- 
ing heap reference analysis. Empirical measurements which corroborate the observations 
made in this section are presented in Section 7. 

The data flow frameworks defined in this paper are not separable [Khedker 2002] be- 
cause the data flow information of a variable depends on the data flow information of other 
variables. Thus the number of iterations over control flow graph is not bounded by the 
depth of the graph [Aho et al. 1986; Hecht 1977; Khedker 2002] but would also depend on 
the number of root variables which depend on each other. 

Although we consider each statement to be a basic block, our control flow graphs retain 
only statements involving references. A further reduction in the size of control flow graphs 
follows from the fact that successive use statements need not be kept separate and can be 
grouped together into a block which ends on a reference assignment. 

The amount of work done in each iteration is not fixed but depends on the size of access 
graphs. Of all operations performed in an iteration, only CFN(G, G') is costly. Conversion 
to deterministic access graphs is also a costly operations but is performed for a single pass 
during null assignment insertion. In practice, the access graphs are quite small because of 
the following reason: Recall that edges in access graphs capture dependence of a reference 
made at one program point on some other reference made at another point (Section 2.4. 1). 
In real programs, traversals involving long dependences are performed using iterative con- 
structs in the program. In such situations, the length of the chain of dependences is limited 
by the process of summarization because summarization treats nodes with the same label 
as being identical. Thus, in real programs chains of such dependences, and hence the ac- 
cess graphs, are quite small in size. This is corroborated by Figure 16 which provides the 
empirical data for the access graphs in our examples. The average number of nodes in 
these access graphs is less than 7 while the average number of edges is less than 12. These 
numbers are still smaller in the interprocedural analysis. Hence the complexities of access 
graph operations is not a matter of concern. 

6. SAFETY OF NULL ASSIGNMENT INSERTION 

We have to prove that the null assignments inserted by our algorithm (Section 4) in a 
program are safe in that they do not alter the result of executing the program. We do this 
by showing that (a) an inserted statement itself does not raise a dereferencing exception, 
and (b) an inserted statement does not affect any other statement, both original and inserted. 

We use the subscripts b and a for a program point p to denote "before" and "after" in 
an execution order. Further, the corresponding program points in the original and modified 
program are distinguished by the superscript o and m. The correspondence is defined as 
follows: If p m is immediately before or after an inserted assignment a = null, p° is the 
point where the decision to insert the null assignment is taken. For any other p m , there is 
an obvious p°. 
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We first assert the soundness of availability, anticipability and alias analyses without 
proving them. 

Lemma 6.1 . (Soundness of Availability Analysis). Let AVp a be the set of access paths 
available at program point p a . Let p G AVp a . Then along every path reaching p a , there ex- 
ists a program point pb, such that the link represented by Frontier(p) is either dereferenced 
or assigned a non-null l-value at pb and is not made null between pb and p a . 

LEMMA 6.2. (Soundness of Anticipability Analysis). Let AN P be the set of access 
paths anticipable at program point p. Let p £ AN P . Then along every path starting from p, 
the link represented by Frontier(p) is dereferenced before being assigned. 

For semantically valid input programs (i.e. programs which do not generate dereferenc- 
ing exceptions), Lemma 6.1 and Lemma 6.2 guarantee that if p is available or anticipable 
at p, Target(p) can be dereferenced at p. 

LEMMA 6.3. (Soundness of Alias Analysis). Let Frontier(p x ) represents the same link 
as Frontier(py) at a program point p during some execution of the program. Then link- 
alias computation ofp x at p would discover p y to be link-aliased to p x . 

For the main claim, we relate the access paths at p a to the access paths at pb by incor- 
porating the effect of intervening statements only, regardless of the statements executed 
before pb- In some execution of a program, let p be the access path of interest at p a and the 
sequence of statements between pb and p a be s. 6 Then T(s,p) represents the access path 
at pb which, if non-£, can be used to access the link represented by Frontier(p). T(s,p) 
captures the transitive effect of backward transfers of p through s. T is defined as follows: 



p s is a use statement 

p s is a* = . . . and p x is not a prefix of p 

£ s is a x = New and p = p x ->a 

£ s is a x = null and p = p x ->a 

p y ->a s is a x = a y and p = p x ->a 

p sis the function call a x = f(<X y ) and 

Root(p) is a global variable 
Pj-i>o s is the function call a x = f(a y ),p = z->o and 

z is the formal parameter of / 
p s is the return statement return(a z ) and 

Root(p) is a global variable 
p z ->0 s is the return statement return(a z ),p — p x ->0 and 

the corresponding call is a x = /((%) 



T(si,T(s2,p)) i is a sequence ii ; 52 

Lemma 6.4. (Liveness Propagation). Let p a be in some explicit liveness graph at p a . 
Let the sequence of statements between pb to p a be s. Then, ifT(s,p a ) = p h and p h is not 
£, then p h is in some explicit liveness graph at pb- 

PROOF. The proof is by structural induction on s. Since p h is non-£ , the base cases are: 

(1) s is a use statement. In this case p h = p a . 

6 When s is a function call a x = f(a y ), p a is the entry point of / and pi, is the program point just before the 
statement s in the caller's body. Analogous remark holds for the return statement. 
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(2) s is an assignment a x = ... such that p x is not a prefix of p". Here also p b = p a . 

(3) s is an assignment a x = a y such that p a = p x ->a. In this case p b = p y ->0. 

(4) s is the function call a x = /(Ofy). The only interesting case is when p a = z~>0, where 
z is the formal parameter of /. In this case, p h = p y -oo. 

(5) s is the return statement return(a z ). The only interesting case is when p" = p x -><5, and 
the corresponding call is a x = f(CL y ). In this case, p b = p z ->0. 

For (1) and (2), since p a is not in ELKillPath, p b is in some explicit liveness graph at pb- 
For (3), from Equation (6), p b is in some explicit liveness graph at pb- For (4) and (5), the 
result follows from the fact that Summary(p y ) and Summary(p z ) are in the explicit liveness 
graph of the program points before the call and return statements respectively. 

For the inductive step, assume that the lemma holds for si and S2- From the definition 
of T, there exists a non-£ p ! at the intermediate point pi between s\ and S2, such that 
p' = T(s2,p a ) and p b = T(si,p'). Since p a is in some explicit liveness graph at p a , by 
the induction hypothesis, p 1 must be in some explicit liveness graph at Further, by the 
induction hypothesis, p b must be in some explicit liveness graph at pb- □ 

LEMMA 6.5. Every access path which is in some liveness graph at p™ is also in some 
liveness graph at p° b . 

PROOF. If an extra explicitly live access path is introduced in the modified program, 
it could be only because of an inserted assignment a = null at some p™ . The only access 
paths which this statement can add to an explicit liveness graph are the paths corresponding 
the proper prefixes of a. However, the algorithm selects a for nullification only if the 
access paths corresponding to all its proper prefixes are in some explicit liveness graph. 
Therefore every access path which is in some explicit liveness graph at p™ is also in some 
explicit liveness graph at p° a . The same relation would hold at p™ and p" b . 

If an extra live access path is introduced in the modified program, it could be only be- 
cause of an inserted assignment a = null at some p™. The only access paths which this 
statement can add to an liveness graphs are LnA(p',/4S m ), where p' is a proper prefix of 
p and AS m represents the alias set at p™. However, the algorithm selects a for nullifi- 
cation at p™ only if the access paths corresponding to all its proper prefixes are in some 
liveness graph at p" a . As liveness graphs are closed under link aliasing, this implies that 
the liveness graph at p° a includes paths Ln A(p', AS"), where AS" represents the alias set at 
p° a . Since inserted statements can only kill aliases, AS" 1 C AS". Thus, LnA(p',/4S m ), the 
paths resulting out of insertion, are also in the liveness graph at p° a . Therefore every access 
path which is in some liveness graph at p% is also in some liveness graph at p" a . The same 
relation would hold at p' b and p° b . □ 

THEOREM 6.1. (Safety of null insertion). Let the assignment a b = null be inserted by 
the algorithm immediately before p™. Then: 

(1) Execution ofa b = null does not raise any exception due to dereferencing. 

(2) Let of be used immediately after p™ (in an original statement or an inserted null 
assignment). Then, execution of(X b = null cannot nullify any link used in of. 

Proof. We prove the two parts separately. 

(1) If of is a root variable, then the execution of a b = null cannot raise an exception. 
When of is not a root variable, from the null assignment algorithm, every proper 
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Fig. 15. Temporal plots of memory usages. 



prefix p' of p b is either anticipable or available. From the soundness of both these 
analyses, Target(p') exists and the execution of <x b = null cannot raise an exception. 
(2) We prove this by contradiction. Let s denote the sequence of statements between p™ 
and p%. Assume that a b = null nullifies a link used in a". This is possible only if 
there exists a prefix p' of p" such that T(s,p') shares its frontier with p b at p™. By 
Lemma 6.4, T(s, p') must be in some explicit liveness graph at p™. From Lemma 6.3 
and the definition of liveness, p b is in some liveness graph at p™. By Lemma 6.5, p b is 
also in some liveness graph at p" h . Thus a decision to insert a b = null cannot be taken 

□ 

7. EMPIRICAL MEASUREMENTS 

In order to show the effectiveness of heap reference analysis, we have developed proof-of- 
concept implementations of heap reference analysis at two levels: One at the interproce- 
dural level and the other at the intraprocedural level. 
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7.1 Experimentation Methodology 

Our intraprocedural analyzer, which predates the interprocedural version is an evidence 
of the effectiveness of intraprocedural analysis. It was implemented using XSB-Prolog 7 . 
The measurements were made on a 800 MHz Pentium III machine with 128 MB memory 
running Fedora Core release 2. The benchmarks used were Loop, DLoop, CReverse, 
BiSort, TreeAdd and GCBench. Three of these (Loop, DLoop and CReverse) are similar 
to those in [Shaham et al. 2003]. Loop creates a singly linked list and traverses it, DLoop 
is doubly linked list variation of the same program, CReverse reverses a singly linked 
list. BiSort and TreeAdd are taken from Java version of Olden benchmark suite [Carlisle 
1996]. GCBench is taken from [Boehm ]. 

For measurements on this implementation, the function of interest in a given Java pro- 
gram was manually translated to Prolog representation. This allowed us to avoid redundant 
information like temporaries, empty statements etc. resulting in a compact representations 
of programs. The interprocedural information for this function was approximated in the 
Prolog representations in the following manner: Calls to non-recursive functions were 
inlined and calls to recursive functions were replaced by iterative constructs which approx- 
imated the liveness property of heap manipulations in the function bodies. The result of 
the analysis was used to manually insert null assignments in the original Java programs to 
create modified Java programs. 

Manual interventions allowed us to handle procedure calls without performing interpro- 
cedural analysis. In order to automate the analysis and extend it to interprocedural level, 
we used SOOT [Vallee-Rai et al. 1999] which has built in support for many of our require- 
ments. However, compared to the Prolog representation of programs, the default Jimple 
representation used by SOOT is not efficient for our purposes because it introduces a large 
number of temporaries and contains all statements even if they do not affect heap reference 
analysis. 

As was described earlier, our interprocedural analysis is very simplistic. Our experi- 
ence shows that imprecision of interprocedural alias analysis increases the size of alias 
information thereby making the analysis inefficient apart from reducing the precision of 
the resulting information. This effect has been worsened by the fact that SOOT introduces 
a large number of temporary variables. Besides, the complete alias information is not 
required for our purposes. 

We believe that our approach can be made much more scalable by 

— Devising a method of avoiding full alias analysis and computing only the required alias 
information, and 

— Improving the Jimple representation by eliminating redundant information, combining 
multiple successive uses into a single statement etc. 

The implementations, along with the test programs (with their original, modified, and 
Prolog versions) are available at [Karkare 2005]. 

7.2 Measurements and Observations 
Our experiments were directed at measuring: 



'Available from http : //xsb . sourcef orge . net. 
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Intraprocedural analysis of selected method (Prolog Implementation) 
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Interprocedural analysis of all methods (SOOT Implementation) 
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0.16 
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- #Iter is the maximum number of iterations taken by any analysis. 

- Analysis Time is the total time taken by all analyses. 

- #G is total number of access graphs created by alias analysis and liveness analysis. Prolog implementation 
performs alias analysis also using access graphs. 

- Max nodes (edges) is the maximum over number of nodes (edges) in all access graphs. In some cases, max- 
imum number nodes/edges is more in case of intraprocedural analysis due to presence of longer paths in 
explicitly supplied boundary information, which gets replaced by a single * node in interprocedural analysis. 

- Avg nodes (edges) is the average number of nodes (edges) over all access graphs. 

- #null is the number of inserted null assignments. 

Fig. 16. Empirical measurements of proof-of-concept implementations of heap reference analyzer. 



(1) The efficiency of analysis. We measured the total time required, number of iterations 
of round robin analyses, and the number and sizes of access graphs. 

(2) The effectiveness of null assignment insertions. The programs were made to create 
huge data structures. Memory usage was measured by explicit calls to garbage col- 
lector in both modified and original Java programs at specific probing points such as 
call sites, call returns, loop begins and loop ends. The overall execution time for the 
original and the modified programs was also measured. 

The results of our experiments are shown in Figure 15 and Figure 16. As can be seen 
from Figure 15, nullification of links helped the garbage collector to collect a lot more 
garbage, thereby reducing the allocated heap memory. In case of BiSort, however, the links 
were last used within a recursive procedure which was called multiple times. Hence, safety 
criteria prevented null assignment insertion within the called procedure. Our analysis could 
only nullify the root of the data structure at the end of the program. Thus the memory was 
released only at the end of the program. 

For interprocedural analysis, class files for both original as well as modified programs 
were generated using SOOT. As can be seen from Figure 16, modified programs executed 
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faster. In general, a reduction in execution time can be attributed to the following two 
factors: (a) a decrease in the number of calls to garbage collector and (b) reduction in the 
time taken for garbage collection in each call. The former is possible because of availability 
of a larger amount of free memory, the latter is possible because lesser reachable memory 
needs to be copied. 8 In our experiments, factor (a) above was absent because the number of 
(explicit) calls to garbage collector were kept same. GCBench showed a large improvement 
in execution time after null assignment insertion. This is because GCBench creates large 
trees in heap, which are not used in the program after creation and our implementation was 
able to nullify left and right subtrees of these trees immediately after their creation. This 
also reduced the high water mark of the heap memory requirement. 

As explained in Section 5.2, sizes of the access graphs (average number of nodes and 
edges) is small. This can be verified from Figure 16. The analysis of DLoop creates a large 
number of access graphs because of the presence of cycles in heap. In such a case, a large 
number of alias pairs are generated, many of which are redundant. Though it is possible 
to reduce analysis time by eliminating redundant alias pairs, our implementation, being a 
proof-of-concept implementation, does not do so for sake of simplicity. 

Our technique and implementation compares well with the technique and results de- 
scribed in [Shaham et al. 2003]. A conceptual comparison with this method is included in 
Section 9.2. The implementation described in [Shaham et al. 2003] runs on a 900 MHz 
P-III with 512 MB RAM running Windows 2000. It takes 1.76 seconds, 2.68 seconds and 
4.79 seconds respectively for Loop, DLoop and CReverse for null assignment insertion. 
Time required by our implementation for the above mentioned programs is given in Fig- 
ure 16. Our implementation automatically computes the program points for null insertion 
whereas their method cannot do so. Our implementation performs much better in all cases. 

8. EXTENSIONS FOR C++ 

This approach becomes applicable to C++ by extending the concept of access graphs to 
faithfully represent the C++ memory model. It is assumed that the memory which be- 
comes unreachable due to nullification of pointers is reclaimed by an independent garbage 
collector. Otherwise, explicit reclamation of memory can be performed by checking that 
no node-alias of a nullified pointer is live. 

In order to extend the concept of access graphs to C++, we need to account for two 
major differences between the C++ and the Java memory model: 

(1) Unlike Java, C++ has explicit pointers. Field of a structure (struct or class) can be 
accessed in two different ways in C++: 

— using pointer dereferencing (*.), e.g. (*x).lptr 9 or 
— using simple dereferencing (.) , e.g. y.rptr. 
We need to distinguish between the two. 

(2) Although root variables are allocated on stack in both C++ and Java, C++ allows a 
pointer/reference to point to root variables on stack through the use of addressof (&) 
operator, whereas Java does not allow a reference to point to stack. Since the root 
nodes in access graphs do not have an incoming edge by definition, it is not possible 
to use access graphs directly to represent memory links in C++. 



8 This happens because Java Virtual Machine uses a copying garbage collector. 
9 This is equivalently written as x->lptr. 
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We create access graphs for C++ memory model as follows: 

(1) We treat dereference of a pointer as a field reference, i.e., * is considered as a field 
named deref. For example, an access expression (*x).lptr is viewed as x.deref.lptr, 
and corresponding access path is x->deref-olptr. The access path for x.lptr is x->lptr. 

(2) Though a pointer can point to a variable x, it is not possible extract the address of &x, 
i.e. no pointer can point to &x. For Java, we partition memory as stack and heap, and 
had root variables of access graphs correspond to stack variables. In C++, we partition 
the memory as address of variables and rest of the memory (stack and heap together). 
We make the roots of access graphs correspond to addresses of variables. A root 

variable y is represented as deref (&y). Thus, represents access 

paths &y and &y->deref and &y->deref->l, which correspond to access expressions 
&y, y and y.l respectively. 

Handling pointer arithmetic and type casting in C++ is orthogonal to above discussion, 
and requires techniques similar to [Yong et al. 1999; Cheng and Hwu 2000] to be used. 



9. RELATED WORK 

Several properties of heap (viz. reachability, sharing, liveness etc.) have been explored in 
past; a good review has been provided by Sagiv et al. [2002]. In this section, we review 
the related work in the main property of interest: liveness. We are not aware of past work 
in availability and anticipability analysis of heap references. 

9.1 Liveness Analysis of Heap Data 

Most of the reported literature in liveness analysis of heap data either does not address 
liveness of individual objects or addresses liveness of objects identified by their allocation 
sites. Our method, by contrast, does not need the knowledge of allocation site. Since the 
precision of a garbage collector depends on its ability to distinguish between reachable 
heap objects and live heap objects, even state of art garbage collectors leave a significant 
amount of garbage uncollected [Agesen et al. 1998; Shaham et al. 2000; 2001; 2002]. All 
reported attempts to incorporate liveness in garbage collection have been quite approxi- 
mate. The known approaches have been: 

( 1 ) Liveness of root variables. A popular approach (which has also been used in some state 
of art garbage collectors) involves identifying liveness of root variable on the stack. 
All heap objects reachable from the live root variables are considered live [Agesen 
etal. 1998]. 

(2) Imposing stack discipline on heap objects. These approaches try to change the stati- 
cally unpredictable lifetimes of heap objects into predictable lifetimes similar to stack 
data. They can be further classified as 

— Allocating objects on call stack. These approach try to detect which objects can be 
allocated on stack frame so that they are automatically deallocated without the need 
of traditional garbage collection. A profile based approach which tracks the last use 
of an object is reported in [McDowell 1998], while a static analysis based approach 
is reported in [Reid et al. 1999]. 

Some approaches ask a converse question: which objects are unstackable (i.e. their 
lifetimes outlive the procedure which created it)? They use abstract interpretation 
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and perform escape analysis to discover objects which escape a procedure[Blanchet 
1999; 2003; Choi et al. 1999]. All other objects are allocated on stack. 

— Associating objects with call stack [Cannarozzi et al. 2000]. This approach identifies 
the stackability. The objects are allocated in the heap but are associated with a stack 
frame and the runtime support is modified to deallocate these (heap) objects when 
the associated stack frame is popped. 

— Allocating objects on separate stack. This approach uses a static analysis called re- 
gion inference [Tofte and Birkedal 1998; Hallenberg et al. 2002] to identify regions 
which are storages for objects. These regions are allocated on a separate region 
stack. 

All these approaches require modifying the runtime support for the programs. 

(3) Liveness analysis of locally allocated objects. The Free-Me approach [Guyer et al. 
2006] combines a lightweight pointer analysis with liveness information that detects 
when allocated objects die and insert statements to free such objects. The analysis is 
simpler and cheaper as the scope is limited, but it frees locally allocated objects only 
by separating objects which escape the procedure call from those which do not. The 
objects which do not escape the procedure which creates them become unreachable at 
the end of the procedure anyway and would be garbage collected. Thus their method 
merely advances the work of garbage collection instead of creating new garbage. Fur- 
ther, this does not happened in the called method. Further, their method uses traditional 
liveness analysis for root variables only and hence can not free objects that are stored 
in field references. 

(4) The Shape Analysis Based based approaches. The two approaches in this category are 
— Heap Safety Automaton approach [Shaham et al. 2003] is a recently reported work 

which comes closest to our approach since it tries to determine if a reference can be 
made null. We discuss this approach in the next section. 
— Cherem and Rugina [2006] use a shape analysis framework [Hackett and Rugina 
2005] to analyze a single heap cell to discover the point in the program where it ob- 
ject becomes unreachable. Their method claims the objects at such points thereby 
reducing the work of the garbage collector. They use equivalence classes of expres- 
sions to store definite points-to and definitely-not points-to information in order to 
increase the precision of abstract reference counts. However, multiple iterations of 
the analysis and the optimization steps are required, since freeing a cell might result 
in opportunities for more deallocations. Their method does not take into account the 
last use of an object, and therefore does not make additional objects unreachable. 

9.2 Heap Safety Automaton Based Approach 

This approach models safety of inserting a null statement at a given point by an automaton. 
A shape graph based abstraction of the program is then model-checked against the heap 
safety automaton. Additionally, they also consider freeing the object; our approach can be 
easily extended to include freeing. 

The fundamental differences between the two approaches are 

— Their method answers the following question: Given an access expression and a program 
point, can the access expression be set to null immediately after that program point? 
However, they leave a very important question unanswered: Which access expressions 
should we consider and at which point in the program? It is impractical to use their 
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method to ask this question for every pair of access expression and program point. Our 
method answers both the questions by finding out appropriate access expressions and 
program points. 

— We insert null assignments at the earliest possible point. The effectiveness of any method 
to improve garbage collection depends crucially on this aspect. Their method does not 
address this issue directly. 

— As noted in Section 7.2, their method is inefficient in practice. For a simple Java program 
containing 1 1 lines of executable statements, it takes over 1.37 MB of storage and takes 
1.76 seconds for answering the question: Can the variable y be set to null after line 10? 

Hence our approach is superior to their approach in terms of completeness, effectiveness, 
and efficiency. 

10. CONCLUSIONS AND FURTHER WORK 

Two fundamental challenges in analyzing heap data are that the temporal and spatial struc- 
tures of heap data seem arbitrary and are unbounded. The apparent arbitrariness arises due 
to the fact that the mapping between access expressions and 1-values varies dynamically. 

The two key insights which allow us to overcome the above problems in the context of 
liveness analysis of heap data are: 

— Creating finite representations for properties of heap data using program structure. We 
create an abstract representation of heap in terms of sets of access paths. Further, a 
bounded representation, called access graphs, is used for summarizing sets of access 
paths. Summarization is based on the fact that the heap can be viewed as consisting 
of repeating patterns which bear a close resemblance to the program structure. Access 
graphs capture this fact directly by tagging program points to access graph nodes. Un- 
like [Horwitz et al. 1989; Chase et al. 1990; Choi et al. 1993; Wilson and Lam 1995; 
Hind et al. 1999] where only memory allocation points are remembered, we remember 
all program points where references are used. This allows us to combine data flow in- 
formation arising out of the same program point, resulting in bounded representations 
of heap data. These representations are simple, precise, and natural. 
The dynamically varying mapping between access expressions and 1-values is handled 
by abstracting out regions in the heap which can possibly be accessed by a program. 
These regions are represented by sets of access paths and access graphs which are ma- 
nipulated using a carefully chosen set of operations. The computation of access graphs 
and access paths using data flow analysis is possible because of their finiteness and 
the monotonicity of the chosen operations. We define data flow analyses for liveness, 
availability and anticipability of heap references. Liveness analysis is an any path prob- 
lem, hence it involve unbounded information requiring access graphs as data flow val- 
ues. Availability and anticipability analyses are all paths problems, hence they involve 
bounded information which is represented by finite sets of access paths. 

— Identifying the minimal information which covers every live link in the heap. An inter- 
esting aspect of our liveness analysis is that the property of explicit liveness captures 
the minimal information which covers every link which can possibly be live. Complete 
liveness is computed by incorporating alias information in explicit liveness. 

An immediate application of these analyses is a technique to improve garbage collection. 
This technique works by identifying objects which are dead and rendering them unreach- 
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able by setting them to null as early as possible. Though this idea was previously known to 
yield benefits [Gadbois et al. ], nullification of dead objects was based on profiling [Sha- 
ham et al. 2001; 2002]. Our method, instead, is based on static analysis. 

For the future work, we find some scope of improvements on both conceptual level and 
at the level of implementation. 

(1) Conceptual Aspects. 

(a) Since the scalability of our method critically depends on the scalability of alias 
analysis, we would like to explore the possibility of avoiding computation of com- 
plete alias information at each program point. Since explicit liveness does not 
require alias information, an interesting question for further investigation is: Just 
how much alias information is enough to compute complete liveness from explicit 
liveness? This question is important because: 

— Not all aliases contribute to complete liveness. 

— Even when an alias contributes to liveness, it needs to be propagated over a 
limited region of the program. 

(b) We have proposed an efficient version of call strings based interprocedural data 
flow analysis in an independent work [Karkare 2007]. It is a generic approach 
which retains full context sensitivity. We would like to use it for heap reference 
analysis. 

(c) We would like to improve the null insertion algorithm so that the same link is not 
nullified more than once. 

(d) We would like to analyze array fragments instead of treating an entire array as a 
scalar (and hence, all elements as equivalent). 

(e) We would also like to extend the scope of heap reference analysis for functional 
languages. The basic method and the details of the liveness analysis are already 
finalized [Karkare et al. 2007]. The details of other analyses are being final- 
ized [Karkare et al. 2007]. 

(2) Implementation Related Aspects. 

(a) We would also like to implement this approach for C/C++ and use it for plugging 
memory leaks statically. 

(b) Our experience with our proof-of-concept implementations indicates that the en- 
gineering choices made in the implementation have a significant bearing on the 
performance of our method. For example, we would like to use a better represen- 
tation than the one provided by SOOT. 

We would also like to apply the summarization heuristic to other analyses. Our initial 
explorations indicate that a similar approach would be useful for extending static infer- 
encing of flow-sensitive monomorphic types [Khedker et al. 2003] to include polymorphic 
types. This is possible because polymorphic types represent an infinite set of types and 
hence discovering them requires summarizing unbounded information. 
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1 x.n = null 




Fig. 17. Non-distributivity of liveness analysis. Access path x->r->n->r is a spurious access path which does not 
get killed by the assignment in block 1 . 
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A. NON-DISTRIBUTIVITY IN HEAP REFERENCE ANALYSIS 

Explicit liveness analysis denned in this paper is not distributive whereas availability and 
anticipability analyses are distributive. Explicit liveness analysis is non-distributive be- 
cause of the approximation introduced by the l±l operation. G\ tfcl G2 may contain access 
paths which are neither in Gi nor in G2. 

Example A. 1 . Figure 17 illustrates the non-distributivity of explicit liveness analysis. 
Liveness graphs associated with the entry each block is shown in shaded boxes. Let f\ 
denote the flow function which computes x-rooted liveness graphs at the entry of block 1 . 
Neither ELI n x (2) nor ELI n x (A) contains the access path x->r->n->r but their union contains 
it. It is easy to see that 

/i(ELIn,(2) W ELIn x (4)) C G / 1 (ELIn,(2)) W/i(ELIn,(4)) 

□ 

Availability and anticipability analyses are non-distributive because they depend on 
may-alias analysis which is non-distributive. 
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